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Abstract 


The  relationship  between  certain  regularization  methods  for  solving 
ill  posed  linear  operator  equations  and  ridge  methods  in  regression 
problems  is  described.  The  regularization  estimates  we  describe  may  be 
viewed  as  ridge  estimates  in  a  (reproducing  kernel)  Hilbert  space  H.  When 
the  solution  is  known  a  priori  to  be  in  some  closed,  convex  set  in  H, 
for  example,  the  set  of  nonnegative  functions,  or  the  set  of  monotone 
functions,  then  one  can  propose  regularized  estimates  subject  to  side 
conditions  such  as  nonnegativity,  monotonicity,  etc.  Some  applications 
in  medicine  and  meteorology  are  described.  We  describe  the  method  of 
generalized  cross  validation  for  choosing  the  smoothing  (or  ridge) 
parameter  in  the  presence  of  a  family  of  linear  inequality  constraints. 

Some  successful  numerical  examples,  solving  ill  posed  convolution  equations 
with  noisy  data,  subject  to  nonnegativity  constraints,  are  presented. 

The  technique  appears  to  be  quite  successful  in  adding  information,  doing 
nearly  the  optimal  amount  of  smoothing,  and  resolving  distinct  peaks 
in  the  solution  which  have  been  blurred  by  the  convolution  operation. 


Prepared  for  the  Proceedings  of  the  Third  Purdue  Symposium  on  Statistical 
Decision  Theory,  S.  S.  Gupta  and  0.  0.  Berger,  eds . 


1.  Introduction 


We  are  interested  in  the  Hilbert  space  version  of  constrained 
ridge  regression,  which  we  will  show  has  many  interesting  applications. 
The  (ridge)  regression  setup  is: 

■^nxl  *nxp^pxl  £nxl 

e  ~  W(0,a2I) 

3  ~  N(0,bE) 

where  X  and  I  are  known,  a2,  b  are  unknown.  A  "ridge-Stein"  estimate  of 
6,  call  it  3  ,  is  given  by  the  minimizer  of  0^(3), 

Qa(3)  =  l\ | y-XB |  |2  + 

where  ||-j|  is  the  Euclidean  mean.  If  A  is  taken  as  a2/nb,  then  it  is 
not  hard  to  show  that 


«  E(3|y).  (i 

If  it  is  known  that  6  is  in  some  closed  convex  set  C  in  Er  ,  then  one  may 
estimate  6  as  the  minimizer  of  Qa(B)  subject  to  the  const-  ieC.  Some 
interesting  C  are  those  determined  by  a  finite  number  of  linear  inequality 
constraints,  for  example  3j  >  0,  i  =  l,2,...,p,  or  3-|  >  6^  >  ...  >  6p. 

M.E.  Bock  discusses  a  related  setup  in  these  proceedings. 

We  particularly  want  to  allow  3  to  have  a  partially  improper  prior, 
for  example,  =  °°.  Then  is  defined  in  the  natural  way  and  will 
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then  not  be  of  full  rank.  This  causes  no  problem  provided  X  and 
Z-1  are  such  that 

Jg'X'Xe  +  AB'S^B  =  0  =»B  =  0.  (1.3) 

An  example  of  a  Hilbert  space  version  of  this  problem  (an  indirect 
sensing  experiment)  is 
1 

y(ti )  =  /K(t.  ,s)f(s)ds  +  e..,  i  =  1,2,... ,n,  0  <  t1  <...<  tp  <  1  (1.4) 

e  ~  N(0,cj2I) 

where  K  is  known,  f  is  known  to  be  in  the  Sobolev  space  W?(M?=  f  :f  ,f ' ,. . .  ,f ^ 
abs.cont.,  f^m^eL2[0,l]),see  Adams  (1975)),  and  a2  is  unknown.  A  so  called 
"regularized"  estimate  f^  of  f  is  given  by  the  minimizer  in  W™  of 

QAf)  =  J  l  (y(t.)  -  /K(t.,s)f(s)ds)2  +  x/(f(m)(s))2ds.  (1.5) 

A  ni=l  1  0  1  0 

Q  (f)  is  analogous  to 

A 

Qx(b)  =  l\  !y-xe| I2  +  ab'jTV 

i 

If  the  linear  functionals  f+f Kft^ ,s)f(s)ds  are  bounded  in  W™  for 
each  i  =  1 ,2,. . .  ,n,  and 

J  l  (/K(t.,s)f(s)ds)2  +  x/(f(m)(s))2ds  =  0  -*f  =  0  (1.6) 

ni=l  0  1  0 

then  Qx(f)  will  have  a  unique  minimizer,  call  it  f^,in  W™. 

If  f  is  endowed  with  the  zero  mean  Gaussian  prior  defined  by:  f  is 
/E>  times  an  unpinned  m-fold  integrated  Weiner  process  (Shepp  (1966)), 
with  a  diffuse  prior  on  the  initial  conditions,  then  it  can  be  shown  .  _ 

(Kimeldorf  and  Wahba  (1971),  Wahba  (1978)),  that  . 

:i  :/:.r 
■1 
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fx(t)  =  E{f(t)|y(t1),...,y(tn)),  (1.7) 

where  X  =  a2/nb.  This  prior  may  be  colloquially  described  as  "f^=white  noise". 

1  /_\ 

However,  with  this  prior  E/(fv  '(s))2ds  is  not  finite,  and  the  meaning 

0 

of  b  as  a  process  parameter  becomes  unclear  for  fcW^.  If  it  is  assumed 
that  feW-,  then  it  appears  to  be  more  appropriate  to  view  X  as  the 
"bandwidth  parameter"  which  governs  the  square  bias-variance  tradeoff. 

If  (1.6)  holds,  then  Qx(f)  will  have  a  unique  minimi zer  in  any 
closed  convex  set  C«H  (see  Wong  (1980),  Gorenflo  and  Hilpert  (1980). 

The  set  of  non-negative  functions  (f :f(s)>0,0<s<l}  is  closed  and  convex 
in  Hj  for  m  =  1,2,...,  and  the  set  of  monotone  increasing  functions 
{f :f 1 (s )>0 ,  0<s<l }  is  closed  convex  in  W™  for  m  =  2,3,...  .  See  also 
Wright  and  Wegman  (1980). 

We  are  interested  in  the  general  formulation  of  the  above  problem. 

The  model  is 


y.j  =  Lt  f  +  e.j ,  i  =  1 ,2,. . .  ,n 

where  it  is  known  that  feCcH,  where  H  is  a  given  Hilbert  space,  C  is  a 

closed,  convex  set  in  H,  and  Lt  ...Lt  are  n  continuous  linear  functions 

li  Si 

on  H.  J(-)  is  a  seminorm  on  H  with  an  m  dimensional  null  space,  and 
it  is  "believed"  that  J(f)  is  not  too  large.  We  propose  estimating 
f  as  the  minimizer  of 

Q,(f)  «  H  Ut  f-yJ2  +  XJ(f)  (1-8) 

x  ni=1  ri  i 


subject  to  feC. 
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If 

l  [  (Lt  f)2  +  XJ(f)  =  0 
ni=l  ri 

r 

=»f  *  0,  then  there  will  be  a  unique  solution,  call  it  .  We  will 
refer  to  this  solution  as  the  constrained  regularized  estimate,  sometimes 
dropping  the  superscript  C. 

There  are  now  two  problems.  One,  given  X,  how  does  one  compute  a 
r 

good  approximation  to  f  ,  and  two,  how  does  one  estimate  a  good  value 
of  X.  In  many  interesting  cases,  when  H  is  a  reproducing  kernel  space, 
the  constraint  set  C  can  be  discretized  in  a  convergent  way,  see  Wahba 
(1973).  For  example,  the  minimizer  of  Q^(f)  subject  to  fcC  =  (f :f(s)>0,0<s<l} 
is  well  approximated  by  the  minimizer  of  Q  (f)  subject  to  feC,  =  (f: 

f  ^ 

f(j)>0,i  =  1 ,2 ,. . .  ,L)  for  H  =  Wj,  J(.)  =  /(f(m)(s))2ds,  L  large.  If  CL  is  any 
(closed)  set  defined  by  L  linear  inequality  constraints,  the  problem  of 
minimizing  Q^(f )  subject  to  feCL  can  be  reduced  to  a  quadratic  programming 
problem  with  linear  inequality  constraints  in  at  most  n  +  m  +  L 
variables.  See  Kimeldorf  and  Wahba  (1971).  The  researcher  interested 
in  numerical  methods  for  this  and  related  problems  may  consult  Anselone 
and  Laurent  (1968),  Utreras  (1979),  Wahba  (1978,  1980a,  1980b,  1981), 

Wahba  and  Wendelberger  (1980).  (The  formulae  in  Kimeldorf  and  Wahba  are 
inappropriate  for  computational  purposes.)  Remarks  concerning  the  effect 
of  quadrature  in  this  setting  may  be  found  in  Wahba  (1981).  Library 
software  for  solving  the  quadratic  programming  problem  by  the  principal 
pivoting  method  is  available,  for  moderate  n  +  m  +  L,  see  MACC  (1979). 

We  will  go  through  a  relatively  simple  example  in  Section  4. 

Our  main  interest  in  this  paper  is  the  development  of  a  method 
for  choosing  X  which  is  suitable  for  the  constrained  problem. 


& 


In  this  paper  we  propose  an  extension  of  the  generalized  cross 
validation  (6CV)  method,  to  the  constrained  case.  This  method  was 
proposed  in  the  unconstrained  case  in  Craven  and  Wahba  (1979), 

Golub,  Heath  and  Wahba  (1979),  and  Wahba  (1977).  The  GCV  estimate  of 
X  we  propose  in  the  constrained  case  can  be  expensive  to  compute.  Thus 
we  propose  a  first  order  approximation  to  it  which  is  very  much  cheaper 
to  compute,  and  appears  to  be  satisfactory  in  the  examples  we  tried. 

We  experimentally  tested  the  constrained  regularization  method  with 
the  approximate  GCV  estimate  of  X  on  a  convolution  equation  with  several 
simulated  data  sets  generated  according  to  the  model  (1.4)  with  non- 
negative  f's.  For  comparison,  we  first  estimated  f  by  minimizing 

O  ~ 

QA(f )  in  W2  and  using  the  (usual)  unconstrained  GCV  estimate  X  for  X. 

We  then  estimated  f  by  minimizing  Q^(f)  in  Cn  where  Cn =  { f : f ( - ) >0 , i « 1 .2 s . . . 
and  choosing  X  by  the  approximate  GCV  method  for  constrained  problems. 

The  constrained  estimates  with  the  approximate  GCV  choice  of  X  were  all 
dramatic  improvements  over  the  unconstrained  estimates.  As  a 
practical  matter,  they  displayed  a  remarkable  ability  to  resolve 
closely  spaced  peaks  in  the  solution  that  have  been  blurred  in  the  data 
by  the  convolution  operation.  The  convolution  equation  is  ill  posed, 
and  the  positivity  constraints  are  apparently  supplying  much  needed 
information.  Three  cases  of  the  exact  GCV  method  for  constrained  problems 
were  tried  for  choosing  X.  It  gave  a  very  slightly  better  (and  possibly 
more  stable)  estimate  of  the  optimal  X.  However  it's  much  more  expensive 
to  compute. 
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2.  Some  Applications 

i)  Meteorology 

In  recent  years  several  satellites  have  been  put  in  orbit  which  carry 
detectors  which  measure  the  upwelling  radiation  at  selected  frequencies. 

The  observed  radiation  at  frequency  v,  when  the  subsatellite  point  is  P, 
may  be  modelled  (after  some  linearization  and  approximation)  as 

Up)  =  /K  (P ,P ’  )T(P '  )dP' , 

ftp 

where  P'  is  a  point  in  the  atmosphere,  Qp  is  the  volume  within  the 
detector  field  of  view  when  the  subsatellite  point  is  P,  T(P')  is  the 
atmospheric  temperature  at  point  P'  and  is  determined  from  the 
equations  of  radiative  transfer.  See  for  example  Fritz  et  al  (1972), 

Smith  et  al  (1979),  Westwater  (1979).  It  is  desired  to  estimate  T(P) 
to  use  as  initial  conditions  in  numerical  weather  forecasting.  Occasionally, 
outside  information,  such  as  the  existence  of  a  temperature  inversion, 
is  available,  thus  providing  some  inequality  conditions  on  the  derivative 
of  T(P)  in  the  vertical  direction. 

ii)  Computerized  Tomography 

Computerized  tomography  machines  are  in  most  well  equipped  hospitals. 
Computerized  tomography  machines  observe  line  (or  more  accurately,  strip) 
integrals  of  the  X-ray  density  f  of  parts  of  the  human  body,  and  from 
this  data 


y,  =  /f(P)dP  +  e, , 
l* 


1 ,2,. . .  ,n. 


estimates  of  f(P)  are  made.  Algorithms  for  estimating  f  must  be  capable 


of  dealing  with  n~10s,  see  Herman  and  Natterer  (1981),  Shepp  and 
Kruskal  (1978).  The  true  f  is  non-negative. 

iii)  Stereo! ogy 

Scientists  studying  tumor  growth  feed  laboratory  mice  a  carcinogen, 
Sacrifice  the  mice,  and  then  freeze  and  slice  the  livers.  Images  of  the 
liver  slices  are  magnified  and  areas  of  tumor  cross  sections  are 
measured.  It  is  expensive  to  examine  the  liver  slices,  thus  it  is 
desired  to  take  a  sample  of  the  possible  slices  and  from  the  resulting  data 
infer  numbers  and  (three  dimensional)  size  distributions  of  tumors  in 
the  entire  liver  from  data  from  a  few  slices.  In  the  "random  spheres" 
model,  the  tumors  are  assumed  to  be  spherical  with  the  radii  density  f(s). 
If  the  slices  are  "random"  then  the  cross  sectional  (two  dimensional) 
density  g(t)  is  related  to  f  by 


See  Anderssen  and  Jakeman  (1975),  Watson  (1971),  Wicksell  (1926). 

This  setup  does  not  fit  into  the  model  (1.4)  because  i)  in  theory  a 
random  sample  from  the  population  with  density  g  is  observed  (not 
g(t.j )+£.{)  and  ii)  in  practice  the  liver  is  embedded  in  a  paraffin  block 
and  sliced  systematically  perpendicular  to  an  axis  which  (roughly) 
maximizes  the  cross  sectional  area  of  the  liver  being  sliced.  Nonetheless, 
it  is  fruitful  to  think  of  this  problem  in  the  context  of  ill  posed 
integral  equations  (see  Anderssen  and  Jakeman  (1975),  Nychka  (1981)). 
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iv)  Convolution  Equations 

Convolution  equations  in  one  and  higher  dimensions  arise  in  many 
areas  of  physics.  See,  for  example  Chambless  (1980),  Davies  (1979). 
These  equations  can  be  surprisingly  ill  posed. 

v)  Other  applications 

Other  applications  may  be  found  in  the  books  of  Anderssen,  DeHoog 
and  Lukas  (1980),  Golberg  (1978),  Tihonov  and  Arsenin  (1977),  Twomey 
(1977),  Washed  (1981). 
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3.  Cross  validation  for  constrained  problems 

We  first  define  the  ordinary  cross  validation  (OCV)  or  "leaving 

out  one"  method  of  choosing  X. 

fkl 

Let  L.j  =  ,  and  let  f^  J  be  the  minimizer  of 

l  I  +  XJ(f)  (3.1) 

ni=l  1  1 

i+k 

subject  to  feCcH,  where  we  assume  sufficient  conditions  on  the  {L^ 
and  J(0  for  existence  and  uniqueness.  A  figure  of  merit  can  be  defined 
for  X  by 

v»  ■  IS  (3-2) 

Tkl 

where  LkfAL  J  is  the  prediction  of  yk  given  the  data  y1 . y|<_1  ,yk+1 ,. . .  ,yp , 

and  using  X.  The  OCV  estimate  of  X  is  the  minimizer  of  VQ(x).  In  the 
unconstrained  ridge  regression  case  this  estimate  is  known  as  Allen's 
PRESS  (see  Hocking1 s  discussion  to  Stone  (1974)).  The  names  of  Mosteller 
and  Tukey  (1968)  Geisser  (1975),  M.  Stone  (1974)  and  others  are  associated 
with  early  work  on  ordinary  cross  validation.  See  also  Wahba  and 
Wold  (1975).  In  the  ridge  regression  case  the  OCV  or  Allen's  PRESS 
has  the  undesireable  property  of  not  being  invariant  under  arbitrary 
rotations  y-*Ty  of  the  data  space.  If  one  observed  Ty  instead  of  y 
the  OCV  estimate  of  X  may  be  different.  GCV  (to  be  defined  below)  may 
be  thought  of  as  a  rotation  invariant  version  of  OCV,  for  which  some  good 
theoretical  properties  may  be  obtained.  For  further  discussion  see 
Craven  and  Wahba  (1979),  Golub,  Heath  and  Wahba  (1979),  Wahba  (1977), 

Utreras  (1978),  Speckman  (1981). 

To  extend  the  definition  of  the  GCV  estimate  of  X  to  constrained 
problems,  we  will  use  the  Theorem  given  below. 


-10- 


Theorem:  Let  H  be  a  Hilbert  space,  J(-)  a  semi  norm  on  H  and  L-j,...,Ln 
be  n  continuous  linear  functionals  on  H,  with  the  property,  that  for  any 
fixed  X  >  0, 

Jj  (L.f)2  +  XJ(f)  =  04f  =  0 
i  =  l  k  =  1  2  n 

ifk  K  i  ,n. 

Let  C  be  a  closed  convex  set  in  H  and  let  f^k^[z]  and  f^[z]  be  the 
minimizers  in  C  of 

sJ,(Lif‘z1)Z  +  XJ(f) 

ifk 

and 

sj/h <f-V!  +  XJ<f>- 

respectively,  where  z  =  (z-j ,. . .  ,zn) ' .  Then 

fA[y+sk]  =  fA[k][y],  k  =  l,2,...,n 

where  6k  =  (0,. . .  ,0,Lkfx^[y]-yk,0,. . .  ,0) '  ,  (the  non  0  entry  is  in  the 
kth  position. 


Remark:  This  theorem  says,  that  given  data 

*1 


^k-l 

Lkfx[k][y] ! 

»k*l 

\  y„  / 


the  minimizer  of  QA(f)  in  C  is  f 


[k] 


(3.3) 
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Proof:  Proofs  in  special  cases  may  be  found  in  Craven  and  Wahba  (1979) 
and  Golub,  Heath  and  Wahba  (1979).  A  proof  in  the  generality  cited  here 
is  in  Wahba  (1980c),  although  no  doubt  the  result  is  a  special  case 
of  classic  optimization  theory  results. 

Now  define  the  "differential  influence"  of  yk  when  A  is  used,  by 
ak*k(A), 


kk 


(A)  = 


LkfA[^k]-LkfA^ 


(3.4) 


where 


Sk  '  LkfX[kl[yl-V 


(3.5) 


ak*k(A)  is  a  divided  difference  of  Lkf^  considered  as  a  function  of  the 
kth  data  point  (and  is  well  defined). 

Applying  Lk  to  both  sides  of  (3.3)  and  substituting  the  result 
into  (3.4)  and  (3.4)  into  (3.2)  gives  the  identity 


v*> -  a 


_  i  ?  (LkVyk>2 


0  nk=i  (l-ak*k(A))2 
The  GCV  estimate  of  A  is  obtained  by  replacing  ak*k(A)  in  (3.6) 

1  n 

by  the  "average  differential  influence"  -  l  ak*k(A),  that  is,  the  GCV 

k — 1 

r 

estimate  of  A  is  obtained  by  minimizing  V ( A )  =  V  (A)  defined  by 

,  nJ1(LkV)'k)! 

VC(X)  • 


(3.6) 


(3.7) 


Some  properties  of  this  estimate  in  the  unconstrained  case  are 
known.  First,  in  the  unconstrained  (C=H)  case,  Lkf  [y]  is  linear  in  y, 
and  there  exists  an  influence  matrix  A(A)  with  the  property 


In  this  case  ( X) ,  the  divided  difference  of  L^f^  with  respect  to 
yk  +  6^  and  yk,  is  also  the  first  derivative 


=  a 


kk 


U) 


where  akk(X)  is  the  kkth  entry  of  A(X).  Then  V(X)  can  be  written 

%  fjl  i I-A(X)y|  |2 

V(X)  =  K - 

CjTrd-A(X)))2 


(3.8) 


To  understand  the  known  (and  potentially  obtainable)  properties 
of  the  GCV  estimate  of  X  we  will  first  compare  it  with  the  unbiassed 
risk  estimates  of  Stein  (see  Hudson  (1974),  Mallows  (1973)). 

Let  L(f,X)  be  the  predictive  mean  square  error  when  X  is  used 

L<f-X>  •  s j,<LkVLkf)! 

■  ;jl  |AU)y-g|  I* 

where  g  =  (Ljf .  .L^f)'  =  Efy. 

If  a2  is  known  (or  an  unbiassed  estimate  of  it  is  avai lable) then  an 
unbiassed  estimate  R(x)  of  R(X)  =  EfL(f,A)  =  ^ 1 1  ( I -A( x) )g j  | 2  +  ^-TrA2(x) 
is  available  and  is  given  by 

R(x)  =  l||(I-A(x))y||2-  ^Tr(I-A(x))2  +  ^TrA2(x), 

this  corresponds  to  Mallows'  C^,  see  Mallows  (1973),  Craven  and  Wahba  (1979). 
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To  talk  about  convergence,  consider  a  family  Lt>  te[0,l]  of  continuous 

linear  functionals  on  H,  with  L.  ,...,L.  a  subset.  Let  K  be  the 

rl  tn 

operator  which  maps  H  into  the  real  valued  functions  on  [0,1]  by  (Kf)(t)  =  Ltf, 
Loosely  speaking,  if  K(H)  is  a  reproducing  kernel  space  with  sufficiently 
smooth  reproducing  kernel,  then  as  tj,...,tn  become  dense  in  [0,1], 

EfV(X)*EfL(f,X)  +  a2 

for  A  in  the  neighborhood  of  the  minimizer  of  E^L(f,A).  See  Wahba 

(1977).  Under  various  circumstances  it  can  be  shown  (Craven  and  Wahba  (1979)),  that 

EfL(.f,X) 

rm'n~E.L(f  ,A)  1  1  as  feH  (3-9> 

A 

where  A  is  the  minimizer  of  EfV(A).  Utreras  (1978)  and  Speckman  (1981) 
have  recently  rigorized  and  strengthened  these  results. 

In  general  for  (3.9)  to  be  true  one  appears  to  need  that  y.|(A)-»0 
and  u12(A)/y2(A)->0  for  A  in  the  neighborhood  of  A*  where 
yi'(A)  =  |jTr^(A)  ancl  **  is  t^ie  minimizer  of  E^L(f,A).  Intuitively, 
this  means  that  the  signal  must  be  concentrated  in  a 
small  "corner"  of  the  data  space  En>  Optimal  rates  of  convergence  for 
f^  corresponding  to  those  given  by  C.  Stone  (1980)  can  be  obtained  in 
some  cases  Craven  and  Wahba  (1979),  Wahba  (1977a,  1977b,  1979b),  Lukas  (1981). 

We  now  return  to  the  constrained  case,  feC  .  We  consider  only  the 
case  where  C  is  (or  is  well  approximated  by)  the  intersection  of  a 
finite  number  of  half-spaces, 


CL  =  {f:N£f>a(H) ,  l  =  1,2,...,L], 
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where  the  are  continuous  linear  functions  on  H.  Even  in  this  special 
case  it  appears  that  to  evaluate  V(X)  of  (3.7)  for  a  single  A  one  must 
solve  n  quadratic  programming  problems  in  as  many  as  n  +  m  +  L  variables. 
To  avoid  this  we  propose  the  following  approximation: 


Replace  the  divided  difference 


ak*k<x>  ■ 


6,. 


(3.10) 


by  the  derivative 


akk^  =  SyT^A^Iy- 


Thus  V(A)  of  (3.7)  is  replaced  by  \pproxM  =  vapprox^)  defined  by 

fii1(LkfA-J'k>2 

Vapprox^  Tn  I  * 


(3.11) 


(3.12) 


For  each  A,  V  (A)  can  be  obtained  by  solving  one  quadratic  optimization 
approx 

problem.  We  outline  the  procedure,  for  more  details,  see  Wahba  (1980b) 
and  the  example  in  Section  4.  First,  solve  the  quadratic  optimization 
problem  to  obtain  f^  and  determine  which  constraints  are  active.  Suppose 
these  correspond  to  N0  ,N0  ,...,N0  .  f,  is  then  also  the  solution  to 

Jo*!  *2  a 

the  quadratic  optimization  problem:  Minimize  Q^(f)  subject  to  N£  f  = 
a(^)>  i  =  1,2,..., I1,.  The  solution  to  this  latter  problem  is  linear  in 
y  and  is  related  to  the  data  through  an  influence  matrix,  call  it  Al,(a). 


(3.13) 


AL. (A)  is  given  explicitly  in  Wahba  (1980b),  see  also  below. 

The  ingredients  for  computing  TrA^,(A)  will  generally  have  been  obtained 

in  the  process  of  setting  up  and  solving  the  quadratic  optimization  problem. 

Unfortunately  Lf.|„  may  be  only  piecewise  well  defined  and 
k  a  y 

continuous  in  A.  If  a  change  in  A  causes  a  change  in  the  active 

constraint  set,  then  one  or  more  of  the  -r2—  Lf.l  may  have  a  jump. 

°yk  k  a  y 

This  can  be  seen  in  the  examples  in  Section  4  and  is  the  major  drawback 
of  the  method.  The  exact  cross  validation  function  V( A )  of  (3.7) 


appears  to  be  a  continuous  function  of  A  for  A  >  0. 


> 
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4.  Numerical  Experiments 


We  numerically  studied  convolution  equations  with  the  model 

1  * 

y,-  =  /k(n"s)f(s)ds  +  £,•»  i  =  l,2,...,n,  n  even. 

0  n  1 

f(s)>0,0<s<l , 

With  J(f)  =  j' (f(m) (s) ) 2ds .  The  constraints  will  be  discretized  to 
0 

f(-j)  >  0,  i  =  l,2,...,n.  To  simplify  the  calculations  while  retaining  many 
of  the  features  of  the  original  problem  we  assumed  that  k(-)  and  f(-) 
were  both  in  the  n  dimensional  subspace  F  of  spanned  by 

{1  ,cos2-rr\)t,  v=1 ,2 ,. . .  ,n/2 ,  sin2iTvt,  v=l  ,2 ,. . .  ,n/2-l }. 


Thus  all  functions  in  Fn  are  periodic  and  the  null  space  of  J(-)  in 
Fp  is  spanned  by  the  single  function  "1".  Also,  f  and  k  are  of  the  form 


where 


n/2-1 


n/2-1 


o  “i  V  _  f:  n  V 


V=1 

n/2-1 

9-  l  l 

V=1 


V=1 

n/2-1 

ni  r 

v=l 


1*/V 


1../1  ■ 


1../1 


i=l 


i  =  l 


^2C0S 

(4.1) 

;n/2cos7rnt 

(4.2) 

;> 

(4.3) 

(4.4) 

* 

/ 


We  have 
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1 

g(t)  =  /k(t-s)f{s)ds 
0 

n/2-1 

=  ^nan  +  2  I.  (avCv'Bvnv)cos27Tvt 


v=l 


n/2-1 

+  2  (avV^)s1n2wt 


+ian/2^n/2COS7Tnt’ 


(4.5) 


and 


n/2-1  „ 

J(f)  =  2  l  (a2+6fi)(2TTv)2m  +  (l/2)a2/9Un) 
v=l 


2m 


v  "v'  ' '  '  ■'  n/2' 


(4.6) 


f, ,  the  minimizer  in  F  of 
A  n 


QXif)  =  n.?1(lk("‘S)f(S)dS'yi)2  +  A](f(m){s))2ds 


is  given  by 


(4.7) 


n/2-U  n/2-U 

f^(t)  =  oq  +  2  l  a  cos2irvt  +  2  £  6vsin2Trvt 
v=l  v  v=l 


+  an/2 


cosirnt 


(4.8) 


where 


a0 

a0^0 

A 

OL 

1 

(a  £  -b  n 

V 

£2+n2+AA 

V  V  V  * 

V  V  V 

A 

6 

1 

(a  n  +b  C 

V 

£2+n2+AA , 

V  V  V^’ 

V  V  V 

A 

.1 

an/2 

iCn/2+AAn/2 

an/2^n/2 

v  =  1,2,... , n/2-1 


(4.9) 


Ay  =  (27iv)‘ 


(4.10) 


a  =  ■  [  cos27rvjyi  v=0,l ,. . .  ,n/2 
v  nj=1  n  j 

1  n  i 

b  =  -  l  sin2iTv*y.  v=l  ,2,. . .  ,n/2-l , 
v  nj=1  n  o 


The  cross  validation  function  V( A)  of  (3.8)  in  the  unconstrained 


case  becomes 


n/2-1  AA  2  AA„/9  2 

,2  Ji  C^+V  (a^,+c^vF] 


2n/|-]  XXv  1  XXn/2  ^ 2 
"v= 1  5v+nv+XXv  "^+XXn/2 


(4.11) 


(4.12) 


In  principle  m  can  be  chosen  by  cross  validation  (see  Gamber  (1979), 

Wahba  and  Wendelberger  (1980).  In  these  experiments  we  have  (arbitrarily) 
set  m  =  2. 

To  study  the  constrained  case  we  write  this  problem  as  follows: 
Letting  x  =  (f (^) . .  ,f (■■) ) ' ,  we  have 


Q.(f)=  I | KWx-Wy| | 2  +  Ax'W'JWx 


(4.13) 


where  the  nxn  matrices  K,  J  and  W  are  given  by 
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\  '  ^  sn/2-l  ' 


where 

c0  =  fi^1  ’*  *  ’ 

11?  n 

Cv  =  f|(cos2^  >  cos2ttv^,.  . .  ,cos2ttv2) 

7  1?  n 

=  ^(sin2iru|i  ,  sin2nv^ .  ,sin2iTV^)  . 

Note  that  WW'  =  Jl. 

We  let  be  the  minimizer  of  (4.13)  subject  to  x  >  0.  The  program 

QUADPR  in  the  Madison  Academic  Computing  Center  Library  (MACC,  1977) 

was  used  to  find  x  to  minimize  the  right  hand  side  of  (4.13)  subject  to  x  >  0. 

This  code  employs  the  principal  pivoting  method  of  Cottle  (1968).  Call  the 

minimizer  x.  .  Letting  the  ith  component  of  x,  be  x,(i),  the  indices 


i ^ , . . .  ,i L ,  for  which  x^(i)  >  0  are  determined. 


Let  E  be  the  n  x  L‘  indicator  matrix  of  these  indices,  that  is,  E  has  a  1 
in  the  ith  row  and  jth  column  if  i  =  i.,  j  =  1,2,...,L',  and  zeroes 

J 

elsewhere.  The  solution  to  the  problem:  minimize 


| | KWx-Wy | |2  +  Xx'W'JWx 

subject  to  x(i)  =  0  for  i  not  one  of  i^,.  -  - , i L •  is 

x,  =  E(E'W,K‘KWE+AE'W,JWE)"1E,W,K,Wy 

A 


(4 


Defining  by 


gC(t)  =  /k(t-s)fRs)ds 


where  f^eFn  satisfies  (f^(jj) ,. . .  ,f^(-j) )  =  x^,  we  have  L^f^  =  g^)  >  and 


C  =  nO! 


L  fC\ 
L1TX  ' 


LnfX  / 


=  nW'  KW  x^  =  Al, (*)y 


(4 


where 

Al,(A)  =  nW 1 KWE ( l j ) " ^  E ' W 1 K 1 W , 

with 


=  E'W'K'KWE,  =  E’W'JWE. 

Therefore  (provided  all  i  for  which  x^(i)  =  0  are  active  constraints!) 
we  have 


n 


n 


3LkfX 


=  Tr(I-AL,(X)) 


=  n-L'+XTrB 
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where 

B  =  Ij(Ix+xIj)  1 » 

Q 

and  the  approximate  cross  validation  function  vapprox(A)  =Vapprox^ 
is 


VC 

approx 


(A) 


| | KWx^-Wy | |2 
(I(n-L'+ATrB))2 


(4.16) 


TrB  =  is  computed  by  first  using  UNPACK  (Dongarra  et  al 

(1979))  to  solve  L1  linear  systems  for  B  defined  by 


(Ik+a!j)c  =  Ij 

and  then  computing  TrB. 

We  pause  to  caution  the  reader  that  roundoff  error  lurks  everywhere 
in  calculating  with  ill  posed  problems  (as  this  will  be  if  k  is  at  all 
"smooth"),  am  calculations  must  be  done  in  double  precision  and  care 
must  be  taken  with  such  simple  quantities  as  ||u-v||2  (don't  compute 
(u,u)-2(u,v)+(v,v)! ). 

To  get  a  nice  example  function  h  in  Fp  for  our  Monte  Carlo  study, 
we  began  with  a  convenient  analytically  defined  function  hQ0(t)  with  hoo(0):hoo(l) , 
constructed  a  function  hQ(t)  satisfying  hQ(0)  -  hQ(l)  by  setting 

ho(t>  ■  hoo(t)  *  <l’oo<°>-hoo<1»t  *  2<hoo("-hoo<°>>- 

Then  we  took  as  our  example  function  h  the  trigonometric  interpolant  to  hQ 
via  (4,l)-(4.4).  For  n  =  64  the  hQ0  and  h  we  used  as  example  functions 
cannot  be  distinguished  visually  on  a  x  11  plot.  For  our  examples  we 
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3 


constructed  k  and  several  f‘seFn  from  k  and  the  f  's  given  below: 

n  oo  oo 


koo(t)  =  — —  e‘t2/2s2  +  e-(l-t)2/2s2  ^  s  =  <043 

,  .  -(t-.3)2/2s2  2  ,  -(t-y)2/2s2 

■  3  —  e  1  +  5  —  «  2 


V^S, 


where 


s1  =  .015,  s2  =  .045 

and  four  different  f's  were  generated  by  letting  the  peak  separation 

1 

p-.3  be  as  in  Table  1.  In  each  example  g(t)  =  Jk(t-s)f(s)ds  is  computed 

i  1  0 

from  (4.3)-(4.5)  given  k(-),  f(~)  for  i  =  l,2,...,n.  Figure  1  gives  a 
plot  of  k(t). 


I 


* 

4 


A 
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Table  1 


Example 

Peak  separation 

Domain 

1  RANGE 

1 

.2 

1.005 

1.002 

2 

.15 

1.016 

1.081 

3 

.10 

1.224 

1.081 

4 

.05 

6.650 

1.318 

* 

4 


Figure  1.  The  convolution  kernel  k(t). 


1 

Figures  2a,  3a,  4a  and  5a  aive  f(t),  g(t)  =  /k(t-s)f(s)ds ,  and 

i  *  0 

yi  =  g(-)  +  c-,  for  examples  1-4,  where  the  ei  were  i.i.d.  W(0 ,o2) 

pseudo  random  variables  with  o  =  .05.  Figures  2b,  3b,  4b  and  5b 
C 

give  f,  f?  and  f?  for  these  same  4  examples.  X  is  the  minimizer  of 

A  Aq 

V( A)  for  unconstrained  problems  given  by  (4.12)  and  computed  by  evaluating 
V(a)  at  equally  spaced  increments  in  log^A,  performing  a  global  search, 
evaluating  V( A )  at  a  finer  set  of  equally  spaced  increments  centered  at 

the  previous  minimum  etc.  The  final  search  is  performed  on  V ( A )  evaluated 

1  /n  r 

at  increments  of  g  in  logA.  A  is  the  minimizer  of  vapprox(^)  (4.16). 

r 

In  these  examples  the  minimum  was  found  by  evaluating  vapprox(^)  at  values 

/V 

of  A  satisfying  logA-logA  =  j(.l)  for  j  =  0 ,±1 , . . .  ,etc.  The  possible 
perils  of  this  process  will  be  discussed  later. 

In  each  example,  a  "ringing"  phenomena  in  the  unconstrained  solution 
is  very  evident.  Intuitively,  the  approximate  solution  retains  some  high 
frequency  components  in  an  attempt  to  capture  the  two  narrow  peaks.  In 
each  of  the  four  examples  the  imposition  of  positivity  constraints  provided 
a  dramatic  improvement  in  the  solution.  Anyone  who  has  attempted  a 
numerical  solution  of  an  ill  posed  problem  knows  that  the  visual  character 
of  the  solution  can  vary  significantly  with  A  (and  to  a  lesser  extent 
with  m,  given  the  optimal  A  for  that  m. )  In  the  unconstrained  solutions, 
the  cross  validation  estimate  of  A  was  near  optimal  in  Examples  1  and  2, 
good  in  Example  3  and  poor  (from  the  point  of  mean  square  error  of  the 
solution)  in  Example  4.  The  data  behind  this  remark  are  given  in 
Table  1.  The  inefficiencies  I^oMAIN  anc*  *RANGE  in  that  tat^e  are  defined 


Figure 


Figure  3.  f,  g,  data,  and  f*  for  Example  2,  peak  separation  =  .15. 


Figure  4.  f,  g,  data,  fc  and  f?  for  Examp 1 


for 
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S(yfx<S>-f<S»: 


DOMAIN  ,  n  .  .  . 

1" 


1 RANGE 


sj,(9J(5,-9(S,)' 


The  theory  (Equation  (3.9))  concerning  the  GCV  estimate  A  says  (roughly) 
that  =  (l+o(1))  as  n-**>. 

We  now  discuss  Example  3  in  greater  detail.  Figure  6  gives  the 
mean  square  error  of  f^,  f^,  g^  and  g^  as  a  function  of  A. 

(MSE(f^)  =  1  ^ (f^(^)-f(^))2,  etc.).  We  have  taken  the  origin  as 

A  A 

logA(logA=-9.889) .  Since  the  GCV  estimate  of  X  estimates  the  minimizer 
of  MSE(gx)  or  MSE(g^),  it  will  generally  be  a  good  estimate  of  the  minimizer 
of  MSE(f^)  or  MSE(f^)  to  the  extent  that  MSE(f^)  and  MSE(g^),  or 

c  c 

MSE(f^)  and  MSE(g^)  have  the  same  minimizer.  The  minimizers  of  the 

four  curves  are  marked  by  arrows.  In  these  and  other  cases  we  have  tried 

(ne[30,100],  smooth  f,  a  a  few  percent  of  max(g(t) (),  the  optimal  \  for 

t 

MSE(f^)  and  MSE(g^)  appear  to  be  close,  as  a  practical  matter.  As  a 
theoretical  phenomena  for  large  n  it  may  or  may  not  be  true,  see  Lukas 
(1981)  for  some  asymptotic  results  on  the  optimal  X  for  different  loss 
functions  in  the  unconstrained  case. 


AtfilriiMftiMMil 
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-3.00  -2.50  -2.00  -1.50  -1.00  -.50  .00  .50 

A  . 


logX  -  logX 

Figure  6.  Comparison  of  mean  square  error  of  estimates  of  f  and  g,  as  a  function  of  X 


Figure  7  gives  V(X)  of  (4.12),  / _ (X)  of  (4.16)  and  VC(x)  of 

approx 

r 

(3.7)  for  Example  3.  V(x)  and  vapproxU)  were  computed  at  increments  of  .1 

~  C 

in  logX.  Xj.  was  taken  as  the  global  minimizer  of  the  computed  vapprox 

p  A  A 

values.  V  and  V,npi4.rtv  at  their  respective  minimizers  X  and  Xr  are 
approx  L 

marked  by  a  large  *.  In  Figure  6,  the  corresponding  MSE  values  at 

/\  A 

X  and  Xc  are  also  marked  by  a  large  *.  In  Figure  7,  some  of  the  computed 
r 

values  of  Vapprox  have  been  connected  by  a  smooth  curve.  Two  adjacent 

points  have  not  been  connected  if  the  set  of  active  constraints  is  different 

r 

for  the  two  corresponding  values  of  X.  vapprox  can  &e  expected  to  have 
at  least  one  discontinuity  somewhere  between  the  two  corresponding 

A 

values  of  X,  (including  the  end  points).  Although  the  estimates  Xc 
worked  well  in  this  and  the  other  three  examples  tried,  there  are  obvious 
pitfalls  in  minimizing  a  discontinuous  function,  e.g.  sensitivity  to  the 
increment  in  logX. 

We  decided  to  invest  a  fair  amount  of  computer  time  to  compute 
r 

V  (X)  for  this  one  example.  The  computed  values  are  indicated  by  □ 

A 

in  Figure  7.  The  computation  was  attempted  for  logX-logX  from  -3.00  to 
.6  in  steps  of  .1.  There  are  missing  values  whenever  the  quadratic 
optimization  routine  QUADPR  terminated  with  an  error  message.  This 
happened  during  the  constrained  minimization  of  the  leaving  out  one 
version  of  (4.13)  in  the  process  of  calculating  a^  of  (3.4),  for  some  k 
(typical  error  message;  "no  complement  variable  found").  Nevertheless  it 
appears  possible  to  connect  the  computed  values  by  a  smooth  curve  and 

A 

find  the  minimum  by  a  global  search  in  a  neighborhood  about  or  below  X. 


J-LLliU  JjJ 


r 

V  at  its  global  minimizer  is  marked  byQ  in  Figure  7,  and  the  MSE 
C  C 

curves  for  and  g^  in  Figure  6  are  also  marked  by  a  □  at  the  minimizer 
r 

of  V  .  Out  of  concern  for  the  computational  failures  with  QUADPR  noted 

above,  it  was  decided  to  try  this  example  for  n  =  50.  The  difficulty 

of  the  quadratic  program  increases  with  n.  Two  replications  were  tried, 

c  c 

In  the  first,  V  (X)  as  well  as  vapproxU))  was  successfully  computed 

/v 

for  logA-logX  in  steps  of  .1  from  -2.4  to  .6.  The  CPU  time  for  n  =  50 
50  C 

was  around  3 )  times  that  for  n  =  64.  V  (X)  was  visually  smooth 

and  convex  near  its  minimum  when  plotted  to  the  same  scale  as  Figure  7 

r 

(equivalently,  to  3  but  not  4  significant  figures).  vapprox  showed 
the  same  apparently  piecewise  continuous  behavior  as  in  the  example  for 

A 

n  =  64.  Both  functions  had  their  global  minimizers  at  logX-logX  =  -.7 

c  *  c 

while  MSE(f^)  was  minimized  at  logX-logX  =  -.8,  for  an  IqqmAIN  of  1.009 

c  c 

^DOMAIN  is  defl’ned  analogously  to  1  DOMAIN  w-ith  f  rePlaced  by  f  »  etc.) 

r 

In  the  second  replication  the  computation  of  a  V  (X)  for  a  few  scattered 
values  of  X  terminated  in  an  error  message  but  nevertheless  a  minimum 

c  c 

of  V  (x)  was  easily  found,  and  resulted  in  IDqmain  of 

The  innocuous-looking  convolution  equation  we  have  studied  here  is 
very  ill  posed,  a  phenomena  surprisingly  comnon  in  many  experiments. 

We  may  write 


y  =  nW'KWx  +  e, 

thus  the  design  matrix  X  is  nW'KW.  If  k  is  symmetric  (as  it  is  here), 
then  the  r^'s  are  all  0  and  K  is  diagonal.  Table  2  gives  the  £v's  of 
(4.2)  and  (4.13),  which  are  also  the  singular  values  of  the  design  matri 

S  ... ,Cn/2_i  are  of  multiplicity  2.  Also  given  in  Table  2  are  the 

^  *  * 

a  ,  6V,  ay  and  defined  by  (4.3)  and  (4.9),  with  x  =  X.  If  £v  is 


Singular 

Fourier  coefficients  of  f  Fourier  coefficients  of  f~  values  of  X 

A  (Eigenvalues  of  K) 


V 

a 

6 

/V 

a 

A 

6 

£ 

V 

V 

V 

V 

V 

0 

1  .000022-0 

. 

1  .0056082 

1 .0000000 

1 

-0  .  620760  4 

0  .6921165 
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Eigenvalues  of  the  design  matrix  and  true  and 
(unconstrained)  estimates  of  Fourier  coefficients 
of  the  solution.  Example  3. 


Table  2. 


sufficiently  small  then  a  ,  6v  are  not  estimable  with  double  precision 

/\  A 

arithmetic  and  it  is  seen  that  and  3^  are  0  (to  as  many  figures  as 
we  have  printed).  Although  XX'  is  theoretically  of  full  rank  (64), 
the  40th  largest  eigenvalue  is  around  10  ^  times  the  largest. 

From  the  examples  we  have  studied,  it  appears  that  the  imposition 
of  positivity  constraints  can  be  an  important  source  of  information  in 
very  ill  posed  problems,  and  that  the  GCV  estimate  for  X  for  constrained 
problems,  and  its  approximate  version  appear  to  do  a  good  job  of  estimating 
X.  Of  course  not  all  problems  will  show  such  a  dramatic  improvement, 
with  the  imposition  of  constraints,  since,  if  no  constraints  are  active, 
then  no  information  has  been  added.  In  some  sense  the  examples  tried 
here  were  chosen  in  anticipation  of  negative  unconstrained  solutions 
(and,  we  must  admit,  with  some  subjective  hunches  on  the  part  of  the 
author  concerning  the  type  of  problem  the  method  is  likely  to  do  well 
on) . 

c 

The  evaluation  of  V  (X)  required  n  +  1  calls  to  QUADPR  at  a  cost 
per  call  for  n  =  64  of  around  5  to  8  seconds  CPU  time  on  the  Madison 
UNIVAC  1110  while  the  computation  of  vapprox(*)  requires  one  such  call. 

It  is  possible  that  a  clever  search  procedure  utilizing  information  from 

c  c 

V(X)  or  vapproxU)  could  be  used  to  obtain  the  minimizer  of  !r(x)  with  a 
small  number  of  functional  evaluations,  particularly  with  an  improved 
quadratic  optimation  routine.  On  the  other  hand  the  minimizer  of 

C 

Vapprox  may  be  ade9uate  many  situations.  It  is  clear  that  both  the 
exact  and  the  approximate  GCV  method  warrants  further  study,  both  theoretically 
and  numerically. 
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